⚡️ Speed up function `get_code_fingerprint` by 20% in PR #733 (`deduplicate-better`) #735

codeflash-ai · 2025-09-13T23:54:24Z

⚡️ This pull request contains optimizations for PR #733

If you approve this dependent PR, these changes will be merged into the original PR branch deduplicate-better.

This PR will be automatically closed if the original PR is merged.

📄 20% (0.20x) speedup for `get_code_fingerprint` in `codeflash/code_utils/deduplicate_code.py`

⏱️ Runtime : 142 milliseconds → 118 milliseconds (best of 64 runs)

📝 Explanation and details

The optimization achieves a 20% speedup by inlining the docstring removal logic and replacing the expensive ast.walk() traversal with a more efficient iterative approach.

Key changes:

Eliminated function call overhead: Removed the separate remove_docstrings_from_ast() function call, saving function invocation costs
Optimized AST traversal: Replaced ast.walk() with an explicit stack-based traversal that only visits nodes that can contain docstrings (FunctionDef, AsyncFunctionDef, ClassDef, Module)
Reduced allocations: The iterative approach creates fewer temporary objects compared to ast.walk()'s recursive generator pattern

Performance impact by test case type:

Large-scale tests (500+ variables/functions) see the biggest gains: 19-25% faster - these benefit most from the reduced traversal overhead
Basic tests with simple functions: 7-12% faster - modest but consistent improvement
Edge cases (empty code, syntax errors): 6-9% faster or minimal impact

The optimization specifically targets the docstring removal step, which previously consumed 24.4% of total runtime in the line profiler. By making the traversal more targeted and eliminating unnecessary node visits, the optimized version reduces this bottleneck while preserving identical functionality and output.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 82 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import ast
import hashlib

# imports
import pytest  # used for our unit tests
from codeflash.code_utils.deduplicate_code import get_code_fingerprint

# unit tests

# 1. Basic Test Cases

def test_fingerprint_identical_code():
    # Identical code should have identical fingerprints
    code1 = "def foo(x):\n    return x + 1"
    code2 = "def foo(x):\n    return x + 1"
    codeflash_output = get_code_fingerprint(code1) # 129μs -> 120μs (7.88% faster)

def test_fingerprint_variable_names_differ():
    # Different variable names should yield same fingerprint
    code1 = "def foo(a):\n    b = a + 1\n    return b"
    code2 = "def foo(x):\n    y = x + 1\n    return y"
    codeflash_output = get_code_fingerprint(code1) # 149μs -> 135μs (10.4% faster)

def test_fingerprint_whitespace_and_comments():
    # Whitespace and comments should not affect fingerprint
    code1 = "def foo(x):\n    # comment\n    return x + 1"
    code2 = "def foo(x):\n    return x + 1"
    codeflash_output = get_code_fingerprint(code1) # 122μs -> 113μs (8.42% faster)

def test_fingerprint_docstrings_removed():
    # Docstrings should not affect fingerprint
    code1 = 'def foo(x):\n    """Docstring"""\n    return x + 1'
    code2 = "def foo(x):\n    return x + 1"
    codeflash_output = get_code_fingerprint(code1) # 128μs -> 116μs (10.9% faster)

def test_fingerprint_function_name_changes():
    # Changing function name should change fingerprint
    code1 = "def foo(x):\n    return x + 1"
    code2 = "def bar(x):\n    return x + 1"
    codeflash_output = get_code_fingerprint(code1) # 120μs -> 109μs (9.62% faster)

def test_fingerprint_class_name_changes():
    # Changing class name should change fingerprint
    code1 = "class Foo:\n    pass"
    code2 = "class Bar:\n    pass"
    codeflash_output = get_code_fingerprint(code1) # 66.6μs -> 62.1μs (7.14% faster)

def test_fingerprint_parameters_preserved():
    # Changing parameter name should not change fingerprint
    code1 = "def foo(a):\n    return a * 2"
    code2 = "def foo(x):\n    return x * 2"
    codeflash_output = get_code_fingerprint(code1) # 120μs -> 109μs (10.1% faster)

def test_fingerprint_different_code_same_result():
    # Different code (not just variable names) should yield different fingerprints
    code1 = "def foo(x):\n    return x + 1"
    code2 = "def foo(x):\n    return x + 2"
    codeflash_output = get_code_fingerprint(code1) # 117μs -> 108μs (7.98% faster)

# 2. Edge Test Cases

def test_empty_code():
    # Empty code should yield a valid fingerprint
    code = ""
    codeflash_output = get_code_fingerprint(code); fingerprint = codeflash_output # 35.8μs -> 33.0μs (8.76% faster)

def test_only_comments():
    # Only comments should yield same fingerprint as empty code
    code1 = ""
    code2 = "# this is a comment\n# another comment"
    codeflash_output = get_code_fingerprint(code1) # 34.9μs -> 32.5μs (7.63% faster)

def test_only_docstring():
    # Only docstring should yield same fingerprint as empty code
    code1 = ""
    code2 = '"""Module docstring"""'
    codeflash_output = get_code_fingerprint(code1) # 34.6μs -> 32.5μs (6.54% faster)

def test_syntax_error():
    # Invalid syntax should raise ValueError
    code = "def foo("
    with pytest.raises(ValueError):
        get_code_fingerprint(code) # 21.7μs -> 22.3μs (2.87% slower)

def test_unicode_identifiers():
    # Unicode variable names should normalize
    code1 = "def foo(α):\n    β = α + 1\n    return β"
    code2 = "def foo(x):\n    y = x + 1\n    return y"
    codeflash_output = get_code_fingerprint(code1) # 161μs -> 146μs (9.79% faster)

def test_attribute_access_not_normalized():
    # Attribute names should not be normalized
    code1 = "def foo(x):\n    return x.attr"
    code2 = "def foo(x):\n    return x.other"
    codeflash_output = get_code_fingerprint(code1) # 109μs -> 101μs (7.90% faster)

def test_nested_functions():
    # Nested functions, variable names normalized in inner scopes
    code1 = "def outer(a):\n    def inner(b):\n        return b + a\n    return inner(a)"
    code2 = "def outer(x):\n    def inner(y):\n        return y + x\n    return inner(x)"
    codeflash_output = get_code_fingerprint(code1) # 173μs -> 159μs (9.27% faster)

def test_class_methods_variable_normalization():
    # Variable names inside methods normalized, method names preserved
    code1 = "class Foo:\n    def bar(self, x):\n        y = x + 1\n        return y"
    code2 = "class Foo:\n    def bar(self, a):\n        b = a + 1\n        return b"
    codeflash_output = get_code_fingerprint(code1) # 165μs -> 151μs (9.24% faster)

def test_async_functions():
    # Async function variable names normalized
    code1 = "async def foo(x):\n    y = await bar(x)\n    return y"
    code2 = "async def foo(a):\n    b = await bar(a)\n    return b"
    codeflash_output = get_code_fingerprint(code1) # 147μs -> 134μs (9.84% faster)

def test_multiple_assignments():
    # Multiple assignments, all variable names normalized
    code1 = "def foo(x):\n    a = x + 1\n    b = a * 2\n    return b"
    code2 = "def foo(y):\n    c = y + 1\n    d = c * 2\n    return d"
    codeflash_output = get_code_fingerprint(code1) # 178μs -> 160μs (11.4% faster)

def test_global_and_nonlocal():
    # Global/nonlocal statements, variable names normalized
    code1 = "def foo():\n    global a\n    a = 1"
    code2 = "def foo():\n    global x\n    x = 1"
    codeflash_output = get_code_fingerprint(code1) # 111μs -> 103μs (8.47% faster)
    code3 = "def outer():\n    a = 0\n    def inner():\n        nonlocal a\n        a = 1"
    code4 = "def outer():\n    x = 0\n    def inner():\n        nonlocal x\n        x = 1"
    codeflash_output = get_code_fingerprint(code3) # 142μs -> 126μs (12.8% faster)

def test_lambda_variable_normalization():
    # Variables in lambdas normalized
    code1 = "f = lambda x: x + 1"
    code2 = "f = lambda y: y + 1"
    codeflash_output = get_code_fingerprint(code1) # 120μs -> 109μs (10.2% faster)

def test_list_comprehension_variable_normalization():
    # Variables in comprehensions normalized
    code1 = "def foo(x):\n    return [a for a in x]"
    code2 = "def foo(x):\n    return [b for b in x]"
    codeflash_output = get_code_fingerprint(code1) # 131μs -> 121μs (8.47% faster)

def test_parameter_default_values():
    # Default values with variable names normalized
    code1 = "def foo(x=1):\n    return x"
    code2 = "def foo(y=1):\n    return y"
    codeflash_output = get_code_fingerprint(code1) # 108μs -> 101μs (7.03% faster)

def test_multiline_and_indentation():
    # Multiline code with varying indentation should not affect fingerprint
    code1 = """
def foo(x):
    y = x + 1
    return y
"""
    code2 = "def foo(x):\n    y = x + 1\n    return y"
    codeflash_output = get_code_fingerprint(code1) # 146μs -> 133μs (9.59% faster)

def test_decorators_preserved():
    # Decorator names should be preserved
    code1 = "@decorator\ndef foo(x):\n    return x"
    code2 = "@other_decorator\ndef foo(x):\n    return x"
    codeflash_output = get_code_fingerprint(code1) # 104μs -> 96.9μs (7.94% faster)

# 3. Large Scale Test Cases

def test_large_number_of_variables():
    # Large number of variables, all normalized
    code1 = "def foo():\n" + "\n".join([f"    a{i} = {i}" for i in range(500)]) + "\n    return a0"
    code2 = "def foo():\n" + "\n".join([f"    x{i} = {i}" for i in range(500)]) + "\n    return x0"
    codeflash_output = get_code_fingerprint(code1) # 8.60ms -> 7.17ms (20.0% faster)

def test_large_code_block():
    # Large code block with many lines, fingerprint should be deterministic
    code = "\n".join([f"def func{i}(x):\n    y = x + {i}\n    return y" for i in range(100)])
    codeflash_output = get_code_fingerprint(code); fingerprint = codeflash_output # 6.78ms -> 5.73ms (18.3% faster)

def test_large_code_variable_name_changes():
    # Large code block, variable names changed, fingerprint should remain same
    code1 = "\n".join([f"def func{i}(a):\n    b = a + {i}\n    return b" for i in range(100)])
    code2 = "\n".join([f"def func{i}(x):\n    y = x + {i}\n    return y" for i in range(100)])
    codeflash_output = get_code_fingerprint(code1) # 6.80ms -> 5.69ms (19.4% faster)

def test_large_code_function_name_changes():
    # Large code block, function names changed, fingerprint should change
    code1 = "\n".join([f"def func{i}(x):\n    y = x + {i}\n    return y" for i in range(100)])
    code2 = "\n".join([f"def f{i}(x):\n    y = x + {i}\n    return y" for i in range(100)])
    codeflash_output = get_code_fingerprint(code1) # 6.77ms -> 5.70ms (18.9% faster)

def test_performance_large_code():
    # Large code block, should compute fingerprint quickly (<1s)
    import time
    code = "\n".join([f"def func{i}(x):\n    y = x + {i}\n    return y" for i in range(500)])
    start = time.time()
    codeflash_output = get_code_fingerprint(code); fingerprint = codeflash_output # 33.2ms -> 27.5ms (20.9% faster)
    duration = time.time() - start
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import ast
import hashlib

# imports
import pytest  # used for our unit tests
from codeflash.code_utils.deduplicate_code import get_code_fingerprint

# unit tests

# --- BASIC TEST CASES ---

def test_identical_code_gives_same_fingerprint():
    """Test that identical code strings produce identical fingerprints."""
    code1 = "def foo():\n    return 42"
    code2 = "def foo():\n    return 42"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 102μs -> 95.1μs (7.70% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 77.7μs -> 70.9μs (9.61% faster)

def test_different_code_gives_different_fingerprint():
    """Test that different code strings produce different fingerprints."""
    code1 = "def foo():\n    return 42"
    code2 = "def foo():\n    return 43"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 92.6μs -> 86.7μs (6.80% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 73.8μs -> 66.8μs (10.4% faster)

def test_variable_name_normalization():
    """Test that variable renaming does not affect the fingerprint."""
    code1 = "def foo():\n    x = 1\n    return x"
    code2 = "def foo():\n    y = 1\n    return y"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 118μs -> 109μs (8.91% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 98.4μs -> 87.8μs (12.1% faster)

def test_whitespace_and_comments_ignored():
    """Test that whitespace and comments do not affect the fingerprint."""
    code1 = "def foo():\n    x = 1\n    return x"
    code2 = "def foo():\n    x=1 # assign\n    return x"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 114μs -> 105μs (8.44% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 96.5μs -> 86.2μs (12.0% faster)

def test_docstring_removal():
    """Test that docstrings are ignored in fingerprint."""
    code1 = 'def foo():\n    """Docstring"""\n    x = 1\n    return x'
    code2 = "def foo():\n    x = 1\n    return x"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 122μs -> 111μs (10.1% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 95.7μs -> 86.5μs (10.6% faster)

def test_multiple_functions():
    """Test that multiple functions with same logic produce same fingerprint."""
    code1 = "def foo():\n    x = 1\n    return x\ndef bar():\n    y = 2\n    return y"
    code2 = "def foo():\n    a = 1\n    return a\ndef bar():\n    b = 2\n    return b"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 171μs -> 152μs (12.0% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 147μs -> 132μs (11.1% faster)

def test_function_and_class_names_preserved():
    """Test that changing function/class names affects fingerprint."""
    code1 = "def foo():\n    x = 1\n    return x"
    code2 = "def bar():\n    x = 1\n    return x"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 113μs -> 104μs (8.94% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 94.4μs -> 85.0μs (11.1% faster)

def test_parameter_names_preserved():
    """Test that changing parameter names affects fingerprint."""
    code1 = "def foo(a):\n    return a"
    code2 = "def foo(b):\n    return b"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 95.3μs -> 88.4μs (7.74% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 77.5μs -> 70.0μs (10.7% faster)

def test_attribute_names_preserved():
    """Test that changing attribute names affects fingerprint."""
    code1 = "class C:\n    def foo(self):\n        self.x = 1"
    code2 = "class C:\n    def foo(self):\n        self.y = 1"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 133μs -> 124μs (7.34% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 114μs -> 104μs (9.26% faster)

# --- EDGE TEST CASES ---

def test_empty_code():
    """Test that empty code returns a valid fingerprint (not error)."""
    code = ""
    codeflash_output = get_code_fingerprint(code); fp = codeflash_output # 35.7μs -> 33.0μs (8.00% faster)

def test_only_comments():
    """Test code with only comments returns same fingerprint as empty code."""
    code1 = ""
    code2 = "# just a comment\n# another comment"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 34.9μs -> 31.9μs (9.45% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 23.0μs -> 21.7μs (6.38% faster)

def test_syntax_error_raises():
    """Test that invalid Python code raises ValueError."""
    code = "def foo(:"
    with pytest.raises(ValueError):
        get_code_fingerprint(code) # 30.6μs -> 31.3μs (2.17% slower)

def test_unicode_and_non_ascii():
    """Test code with unicode identifiers and string literals."""
    code1 = "def foo():\n    x = '你好'\n    return x"
    code2 = "def foo():\n    y = '你好'\n    return y"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 123μs -> 112μs (9.05% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 102μs -> 92.1μs (10.8% faster)

def test_global_and_nonlocal_statements():
    """Test that global/nonlocal statements are handled."""
    code1 = "def foo():\n    global x\n    x = 1"
    code2 = "def foo():\n    global y\n    y = 1"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 111μs -> 103μs (7.82% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 92.3μs -> 83.6μs (10.4% faster)

def test_for_loop_variable_normalization():
    """Test that for loop variable names are normalized."""
    code1 = "for x in range(5):\n    print(x)"
    code2 = "for y in range(5):\n    print(y)"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 125μs -> 113μs (10.8% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 104μs -> 90.9μs (15.1% faster)

def test_comprehension_variable_normalization():
    """Test that comprehension variable names are normalized."""
    code1 = "[x for x in range(5)]"
    code2 = "[y for y in range(5)]"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 112μs -> 103μs (8.67% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 95.0μs -> 83.0μs (14.4% faster)

def test_nested_functions_and_classes():
    """Test nested function/class variable normalization."""
    code1 = "class C:\n    def foo(self):\n        x = 1\n        return x"
    code2 = "class C:\n    def foo(self):\n        y = 1\n        return y"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 141μs -> 129μs (8.50% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 120μs -> 110μs (9.28% faster)

def test_lambda_variable_normalization():
    """Test that lambda variable names are normalized."""
    code1 = "f = lambda x: x + 1"
    code2 = "f = lambda y: y + 1"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 121μs -> 109μs (11.0% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 100μs -> 90.1μs (11.9% faster)

def test_del_statement():
    """Test that variable names in del statements are normalized."""
    code1 = "del x"
    code2 = "del y"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 55.8μs -> 51.1μs (9.01% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 39.7μs -> 35.5μs (11.7% faster)

def test_multiple_assignments():
    """Test that multiple assignments with different variable names are normalized."""
    code1 = "x, y = 1, 2"
    code2 = "a, b = 1, 2"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 104μs -> 95.3μs (9.88% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 86.8μs -> 76.3μs (13.8% faster)

def test_augmented_assignment():
    """Test augmented assignment variable normalization."""
    code1 = "x += 1"
    code2 = "y += 1"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 68.3μs -> 61.5μs (11.0% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 51.5μs -> 45.7μs (12.9% faster)

def test_variable_shadowing():
    """Test that variable shadowing is normalized."""
    code1 = "def foo():\n    x = 1\n    return x\ndef bar():\n    x = 2\n    return x"
    code2 = "def foo():\n    y = 1\n    return y\ndef bar():\n    z = 2\n    return z"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 172μs -> 156μs (9.78% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 150μs -> 133μs (12.4% faster)

# --- LARGE SCALE TEST CASES ---

def test_large_number_of_variables():
    """Test normalization with many variables (scalability)."""
    code1 = "\n".join([f"x{i} = {i}" for i in range(500)]) + "\n" + \
            "total = 0\n" + \
            "\n".join([f"total += x{i}" for i in range(500)]) + "\nprint(total)"
    code2 = "\n".join([f"y{i} = {i}" for i in range(500)]) + "\n" + \
            "total = 0\n" + \
            "\n".join([f"total += y{i}" for i in range(500)]) + "\nprint(total)"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 16.6ms -> 13.4ms (24.6% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 16.6ms -> 13.3ms (24.5% faster)

def test_large_function_body():
    """Test normalization with a large function body."""
    func_body1 = "\n".join([f"    x{i} = {i}" for i in range(300)]) + "\n    return x299"
    func_body2 = "\n".join([f"    y{i} = {i}" for i in range(300)]) + "\n    return y299"
    code1 = "def foo():\n" + func_body1
    code2 = "def foo():\n" + func_body2
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 5.20ms -> 4.34ms (19.9% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 5.16ms -> 4.30ms (20.1% faster)

def test_large_class_with_many_methods():
    """Test normalization for a class with many methods and variables."""
    methods1 = "\n".join([f"    def m{i}(self):\n        x = {i}\n        return x" for i in range(100)])
    methods2 = "\n".join([f"    def m{i}(self):\n        y = {i}\n        return y" for i in range(100)])
    code1 = f"class C:\n{methods1}"
    code2 = f"class C:\n{methods2}"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 5.49ms -> 4.69ms (17.0% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 5.45ms -> 4.66ms (16.9% faster)

def test_large_comprehension():
    """Test normalization for large comprehensions."""
    code1 = "[x for x in range(1000)]"
    code2 = "[y for y in range(1000)]"
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 120μs -> 110μs (9.20% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 94.6μs -> 85.2μs (11.1% faster)

def test_large_file_with_comments_and_docstrings():
    """Test normalization for a large file with many comments and docstrings."""
    code1 = "\n".join([
        '"""Module docstring"""',
        "# This is a comment",
        "def foo():",
        '    """Function docstring"""',
        "    x = 1  # assign",
        "    return x",
        "# Another comment",
        "def bar():",
        '    """Another docstring"""',
        "    y = 2",
        "    return y"
    ] * 80)  # total ~800 lines
    code2 = "\n".join([
        '"""Module docstring"""',
        "# This is a comment",
        "def foo():",
        '    """Function docstring"""',
        "    a = 1  # assign",
        "    return a",
        "# Another comment",
        "def bar():",
        '    """Another docstring"""',
        "    b = 2",
        "    return b"
    ] * 80)
    codeflash_output = get_code_fingerprint(code1); fp1 = codeflash_output # 9.22ms -> 7.71ms (19.5% faster)
    codeflash_output = get_code_fingerprint(code2); fp2 = codeflash_output # 9.15ms -> 7.68ms (19.2% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from codeflash.code_utils.deduplicate_code import get_code_fingerprint
import pytest

def test_get_code_fingerprint():
    with pytest.raises(TypeError, match='compile\\(\\)\\ arg\\ 1\\ must\\ be\\ a\\ string,\\ bytes\\ or\\ AST\\ object'):
        get_code_fingerprint('')

To edit these changes git checkout codeflash/optimize-pr733-2025-09-13T23.54.19 and push.

The optimization achieves a **20% speedup** by **inlining the docstring removal logic** and replacing the expensive `ast.walk()` traversal with a more efficient iterative approach. **Key changes:** - **Eliminated function call overhead**: Removed the separate `remove_docstrings_from_ast()` function call, saving function invocation costs - **Optimized AST traversal**: Replaced `ast.walk()` with an explicit stack-based traversal that only visits nodes that can contain docstrings (`FunctionDef`, `AsyncFunctionDef`, `ClassDef`, `Module`) - **Reduced allocations**: The iterative approach creates fewer temporary objects compared to `ast.walk()`'s recursive generator pattern **Performance impact by test case type:** - **Large-scale tests** (500+ variables/functions) see the biggest gains: **19-25% faster** - these benefit most from the reduced traversal overhead - **Basic tests** with simple functions: **7-12% faster** - modest but consistent improvement - **Edge cases** (empty code, syntax errors): **6-9% faster** or minimal impact The optimization specifically targets the docstring removal step, which previously consumed **24.4%** of total runtime in the line profiler. By making the traversal more targeted and eliminating unnecessary node visits, the optimized version reduces this bottleneck while preserving identical functionality and output.

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Sep 13, 2025

codeflash-ai bot mentioned this pull request Sep 13, 2025

deduplicate optimizations better #733

Merged

misrasaurabh1 closed this Sep 14, 2025

codeflash-ai bot deleted the codeflash/optimize-pr733-2025-09-13T23.54.19 branch September 14, 2025 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

⚡️ Speed up function `get_code_fingerprint` by 20% in PR #733 (`deduplicate-better`) #735

⚡️ Speed up function `get_code_fingerprint` by 20% in PR #733 (`deduplicate-better`) #735

codeflash-ai bot commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

⚡️ Speed up function get_code_fingerprint by 20% in PR #733 (deduplicate-better) #735

⚡️ Speed up function get_code_fingerprint by 20% in PR #733 (deduplicate-better) #735

Conversation

codeflash-ai bot commented Sep 13, 2025

⚡️ This pull request contains optimizations for PR #733

📄 20% (0.20x) speedup for get_code_fingerprint in codeflash/code_utils/deduplicate_code.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `get_code_fingerprint` by 20% in PR #733 (`deduplicate-better`) #735

⚡️ Speed up function `get_code_fingerprint` by 20% in PR #733 (`deduplicate-better`) #735

📄 20% (0.20x) speedup for `get_code_fingerprint` in `codeflash/code_utils/deduplicate_code.py`